Slovak National Corpus tools and resources

نویسنده

  • Radovan Garabík
چکیده

The article presents current state of affairs in several projects conducted by the Slovak National Corpus department of the L’. Štúr Institute of Linguistics, Slovak Academy of Sciences. We describe the Slovak National Corpus, Corpus of Spoken Slovak, tools used for linguistics analysis and an ongoing effort to create Slovak WordNet. 1 Slovak National Corpus The Slovak National Corpus is a huge, representative corpus of modern written Slovak (since the 1953 orthography reform). Currently, the whole corpus contains over 700 million tokens. There are several specialised subcorpora (fiction, professional texts, journalistic texts, original Slovak fiction, balanced subcorpus, texts written until 1989). The corpus is automatically lemmatised and morphologically annotated and is indexed using the Manatee software [Ryc00]. To query the corpus, there are two possibilities – first, the users can use multiplatform (Tcl/Tk) Bonito client to access the Manatee server, using its own protocol. This approach provides the users with complete access to all the advanced querying, sorting and statistical features of the server, however requires installation of a specialized software. The other possibility is to use web based access, where only basic features are present. In both cases, the search interface provides CQL compatible query syntax. However, in the last few years the ability of an average user to install arbitrary software (and use anything that is not web-based) declined considerably, and new corpus users often face an insurmountable obstacle in downloading, unpacking and running the Bonito client. Because of this, we are considering transfer of the corpus to Manatee-2, which provides complete web-based interface as a replacement of the Tcl/Tk client. A separate corpus (although part of the whole Slovak National Corpus project) is a manually morphologically annotated corpus, whose main purpose is to be a source of train data for Slovak language tagger (and, to a lesser extent, for morphology annotation tools). The size of the Slovak National Corpus source archives is 46 GB, however, a substantial percentage of this are original scan images (when converted into raw XML text, the size is about 6 GB uncompressed).

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

5 th Workshop on Intelligent and Knowledge oriented Technologies

The article presents current state of affairs in several projects conducted by the Slovak National Corpus department of the L’. Štúr Institute of Linguistics, Slovak Academy of Sciences. We describe the Slovak National Corpus, Corpus of Spoken Slovak, tools used for linguistics analysis and an ongoing effort to create Slovak WordNet. 1 Slovak National Corpus The Slovak National Corpus is a huge...

متن کامل

Comparison of two different techniques of warfarin dosing determination - A chemometrics study

A high prevalence of genetic polymorphisms increases sensitivity to warfarin therapy. In this study, we investigated 47 patients with effective long-term therapy by warfarin well-controlled by monitoring of International Normalised Ratio (INR). All patients were tested for gene polymorphisms VKORC1, CYP2C9*C2, and CYP2C9*C3, which were used for a dose calculation employing a program www.Warfari...

متن کامل

Comparison of two different techniques of warfarin dosing determination - A chemometrics study

A high prevalence of genetic polymorphisms increases sensitivity to warfarin therapy. In this study, we investigated 47 patients with effective long-term therapy by warfarin well-controlled by monitoring of International Normalised Ratio (INR). All patients were tested for gene polymorphisms VKORC1, CYP2C9*C2, and CYP2C9*C3, which were used for a dose calculation employing a program www.Warfari...

متن کامل

TUKE-BNews-SK: Slovak Broadcast News Corpus Construction and Evaluation

This article presents an overview of the existing acoustical corpuses suitable for broadcast news automatic transcription task in the Slovak language. The TUKE-BNews-SK database created in our department was built to support the application development for automatic broadcast news processing and spontaneous speech recognition of the Slovak language. The audio corpus is composed of 479 Slovak TV...

متن کامل

Improving SMT by Using Parallel Data of a Closely Related Language

The amount of training data in statistical machine translation critically affects translation quality. In this paper, we demonstrate how to increase translation quality for one language pair by introducing parallel data from a closely related language. Specifically, we improve English→Slovak translation using a large Czech– English parallel corpus and a shallow MT system for Czech→Slovak transl...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2010